Automatic Classification of Unstructured Blog Text
نویسندگان
چکیده
Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming; statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the naïve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the naïve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted.
منابع مشابه
How Much Noise in Text is too Much: A Study in Automatic Document Classification
Noise is a stark reality in real life data. Especially in the domain of text analytics it has a significant impact as data cleaning forms a very large part (upto 80% time) of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech data, and automatically recognized text f...
متن کاملCoreference Resolution on Blogs and Commented News
We focus on automatic coreference resolution for blogs and news articles with user comments as part of a project on opinion mining. We aim to study the effect of the genre shift from edited structured newspaper text to unedited, unstructured blog data. We compare our coreference resolution system on three data sets: newspaper articles, mixed newspaper articles and reader comments, and blog data...
متن کاملSentiment Classification of Social Issues Using Contextual Valence Shifters
The growth of science and technology contributes in the growth of social website and electronic media at vast scale. Due to development in field of information technology, all information about anything is globally available on internet, which is great source of data and information. Data or data sets available on internet in unstructured form. To analysis the unstructured data, we need method ...
متن کاملAutomatic Text Classification: A Technical Review
Automatic Text Classification is a semi-supervised machine learning task that automatically assigns a given document to a set of pre-defined categories based on its textual content and extracted features. Automatic Text Classification has important applications in content management, contextual search, opinion mining, product review analysis, spam filtering and text sentiment mining. This paper...
متن کاملImproving ontology-based text classification: An occupational health and security application
Information retrieval has been widely study due to the growing amounts of textual information available electronically. Nowadays organizations and industries are facing the challenge of organizing, analyzing and extract knowledge from masses of unstructured information for decision making process. The development of automatic methods to produce usable structured information from unstructured te...
متن کامل